555 research outputs found
Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms
This manuscript is concerned with relating two approaches that can be used to
explore complex dependence structures between categorical variables, namely
Bayesian partitioning of the covariate space incorporating a variable selection
procedure that highlights the covariates that drive the clustering, and
log-linear modelling with interaction terms. We derive theoretical results on
this relation and discuss if they can be employed to assist log-linear model
determination, demonstrating advantages and limitations with simulated and real
data sets. The main advantage concerns sparse contingency tables. Inferences
from clustering can potentially reduce the number of covariates considered and,
subsequently, the number of competing log-linear models, making the exploration
of the model space feasible. Variable selection within clustering can inform on
marginal independence in general, thus allowing for a more efficient
exploration of the log-linear model space. However, we show that the clustering
structure is not informative on the existence of interactions in a consistent
manner. This work is of interest to those who utilize log-linear models, as
well as practitioners such as epidemiologists that use clustering models to
reduce the dimensionality in the data and to reveal interesting patterns on how
covariates combine.Comment: Preprin
Bayesian nonparametric models for spatially indexed data of mixed type
We develop Bayesian nonparametric models for spatially indexed data of mixed
type. Our work is motivated by challenges that occur in environmental
epidemiology, where the usual presence of several confounding variables that
exhibit complex interactions and high correlations makes it difficult to
estimate and understand the effects of risk factors on health outcomes of
interest. The modeling approach we adopt assumes that responses and confounding
variables are manifestations of continuous latent variables, and uses
multivariate Gaussians to jointly model these. Responses and confounding
variables are not treated equally as relevant parameters of the distributions
of the responses only are modeled in terms of explanatory variables or risk
factors. Spatial dependence is introduced by allowing the weights of the
nonparametric process priors to be location specific, obtained as probit
transformations of Gaussian Markov random fields. Confounding variables and
spatial configuration have a similar role in the model, in that they only
influence, along with the responses, the allocation probabilities of the areas
into the mixture components, thereby allowing for flexible adjustment of the
effects of observed confounders, while allowing for the possibility of residual
spatial structure, possibly occurring due to unmeasured or undiscovered
spatially varying factors. Aspects of the model are illustrated in simulation
studies and an application to a real data set
Notes to Robert et al.: Model criticism informs model choice and model comparison
In their letter to PNAS and a comprehensive set of notes on arXiv
[arXiv:0909.5673v2], Christian Robert, Kerrie Mengersen and Carla Chen (RMC)
represent our approach to model criticism in situations when the likelihood
cannot be computed as a way to "contrast several models with each other". In
addition, RMC argue that model assessment with Approximate Bayesian Computation
under model uncertainty (ABCmu) is unduly challenging and question its Bayesian
foundations. We disagree, and clarify that ABCmu is a probabilistically sound
and powerful too for criticizing a model against aspects of the observed data,
and discuss further the utility of ABCmu.Comment: Reply to [arXiv:0909.5673v2
Statistical tools for synthesizing lists of differentially expressed features in related experiments
A novel approach for finding a list of features that are commonly perturbed in two or more experiments, quantifying the evidence of dependence between the experiments by a ratio
StabJGL: a stability approach to sparsity and similarity selection in multiple network reconstruction
In recent years, network models have gained prominence for their ability to
capture complex associations. In statistical omics, networks can be used to
model and study the functional relationships between genes, proteins, and other
types of omics data. If a Gaussian graphical model is assumed, a gene
association network can be determined from the non-zero entries of the inverse
covariance matrix of the data. Due to the high-dimensional nature of such
problems, integrative methods that leverage similarities between multiple
graphical structures have become increasingly popular. The joint graphical
lasso is a powerful tool for this purpose, however, the current AIC-based
selection criterion used to tune the network sparsities and similarities leads
to poor performance in high-dimensional settings. We propose stabJGL, which
equips the joint graphical lasso with a stable and accurate penalty parameter
selection approach that combines the notion of model stability with
likelihood-based similarity selection. The resulting method makes the powerful
joint graphical lasso available for use in omics settings, and outperforms the
standard joint graphical lasso, as well as state-of-the-art joint methods, in
terms of all performance measures we consider. Applying stabJGL to proteomic
data from a pan-cancer study, we demonstrate the potential for novel
discoveries the method brings. A user-friendly R package for stabJGL with
tutorials is available on Github at https://github.com/Camiling/stabJGL
Bayesian regression discontinuity designs: Incorporating clinical knowledge in the causal analysis of primary care data
The regression discontinuity (RD) design is a quasi-experimental design that
estimates the causal effects of a treatment by exploiting naturally occurring
treatment rules. It can be applied in any context where a particular treatment
or intervention is administered according to a pre-specified rule linked to a
continuous variable. Such thresholds are common in primary care drug
prescription where the RD design can be used to estimate the causal effect of
medication in the general population. Such results can then be contrasted to
those obtained from randomised controlled trials (RCTs) and inform prescription
policy and guidelines based on a more realistic and less expensive context. In
this paper we focus on statins, a class of cholesterol-lowering drugs, however,
the methodology can be applied to many other drugs provided these are
prescribed in accordance to pre-determined guidelines. NHS guidelines state
that statins should be prescribed to patients with 10 year cardiovascular
disease risk scores in excess of 20%. If we consider patients whose scores are
close to this threshold we find that there is an element of random variation in
both the risk score itself and its measurement. We can thus consider the
threshold a randomising device assigning the prescription to units just above
the threshold and withholds it from those just below. Thus we are effectively
replicating the conditions of an RCT in the area around the threshold, removing
or at least mitigating confounding. We frame the RD design in the language of
conditional independence which clarifies the assumptions necessary to apply it
to data, and which makes the links with instrumental variables clear. We also
have context specific knowledge about the expected sizes of the effects of
statin prescription and are thus able to incorporate this into Bayesian models
by formulating informative priors on our causal parameters.Comment: 21 pages, 5 figures, 2 table
Statistical Methods in Integrative Genomics
Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions
- …